Table of Contents

  • 1  Archives and DataSets

  • 2  Archive Format

    • 2.1  Basic Usage

    • 2.2  Large Arrays

    • 2.3  Importable Archives

  • 3  Archive Details

    • 3.1  Single-item Archives

    • 3.2  Containers, Duplicates, and Circular References

  • 4  Archive Examples

    • 4.1  Non-scoped (flat) Format

      • 4.1.1  Scoped Format

  • 5  DataSet Format

  • 6  DataSet Examples

Archives and DataSets

There are two main classes provided by the persist module: persist.Archive and persist.DataSet.

Archives deal with the linkage between objects so that if multiple objects are referred to, they are only stored once in the archive. Archives provide two main methods for to serialize the data:

  1. Via the str() operator which will return a string that can be executed to restore the archive.
  2. Via the Archive.save() method which will export the archive to an importable python package or module.

DataSets use archives to provide storage for multiple sets of data along with associated metadata. Each set of data is designed to be accessed concurrently using locks.

Archive Format

The persist.Archive object maintains a collection of python objects that are inserted with persist.Archive.insert(). This can be serialized to a string and then reconstituted through evaluation:

Basic Usage

[1]:
from persist.archive import Archive

a = 1
x = range(2)
y = range(3)   # Implicitly reference in archive
b = [x, y, y]  # Nested references to x and y

# scoped=False is prettier, but slower and not as safe
archive = Archive(scoped=False)
archive.insert(a=a, x=x, b=b)

# Get the string representation
s = str(archive)
print(s)
from builtins import range as _range
_g3 = _range(0, 3)
x = _range(0, 2)
b = [x, _g3, _g3]
a = 1
del _range
del _g3
try: del __builtins__, _arrays
except NameError: pass
[2]:
d = {}
exec(s, d)
print(d)
assert d['a'] == a
assert d['x'] == x
assert d['b'] == b
assert d['b'][1] is d['b'][2]  # Note: these are the same object
{'x': range(0, 2), 'b': [range(0, 2), range(0, 3), range(0, 3)], 'a': 1}

Large Arrays

If you have large arrays of data, then it is better to store those externally. To do this, set the array_threshold to specify the maximum number of elements to store in an inline array. Any large array will be stored in Archive.data and not be included in the string representation. To properly reconstitute the archive, this data must be provided in the environment as a dictionary with key Archive.data_name which defaults to _arrays:

[3]:
import os.path, tempfile, shutil, numpy as np
from persist.archive import Archive

a = 1
x = np.arange(10)
y = np.arange(20)  # Implicitly reference in archive
b = [x, y]

archive = Archive(scoped=False, array_threshold=5)
archive.insert(a=a, x=x, b=b)

# Get the string representation
s = str(archive)
print(s)
print(archive.data)
x = _arrays['array_0']
b = [x, _arrays['array_1']]
a = 1
try: del __builtins__, _arrays
except NameError: pass
{'array_0': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'array_1': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])}

To evaluate the string representation, we need to provide the _arrays dictionary:

[4]:
d = dict(_arrays=archive.data)
exec(s, d)
print(d)
{'x': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'b': [array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])], 'a': 1}

To store the data, use Archive.save_data():

[5]:
import os.path, tempfile
tmpdir = tempfile.mkdtemp()  # Make temporary directory for data
datafile = os.path.join(tmpdir, 'arrays')
archive.save_data(datafile=datafile)
print(tmpdir)
!ls $tmpdir/arrays
!rm -rf $tmpdir
/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmp5jutcwlt
array_0.npy array_1.npy

Importable Archives

(New in version 1.0)

Archives can be saved as importable packages using the save() method. This will write the representable portion of the archive as an importable module with additional code to load any external arrays. Archives can be saved as a full package – a directory with a <name>/__init__.py file etc. or as a single <name>.py module. These can be imported without the persist package:

[6]:
import os.path, sys, tempfile, shutil, numpy as np
from persist.archive import Archive

tmpdir = tempfile.mkdtemp()

a = 1
x = np.arange(10)
y = np.arange(20)  # Implicitly reference in archive
b = [x, y]

archive = Archive(array_threshold=5)
archive.insert(a=a, x=x, b=b)
archive.save(dirname=tmpdir, name='mod1', package=True)
archive.save(dirname=tmpdir, name='mod2', package=False)

!tree $tmpdir

sys.path.append(tmpdir)
import mod1, mod2
sys.path.pop()
for mod in [mod1, mod2]:
    assert mod.a == a and np.allclose(mod.x, x)
!rm -rf $tmpdir
/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmp9ahg71yq
|-- mod1
|   |-- __init__.py
|   `-- _arrays
|       |-- array_0.npy
|       `-- array_1.npy
|-- mod2.py
`-- mod2_arrays
    |-- array_0.npy
    `-- array_1.npy

3 directories, 6 files

Archive Details

Single-item Archives

(New in version 1.0)

If an archive contains a single item, then the representation can be simplified so that importing the module will result in the actual object. This is mainly for use in DataSets where it allows us to have large objects in a module that only get loaded if explicitly imported. In this case, one can also omit the name when calling Archive.save() as it defaults to the name of the single item.

[7]:
import os.path, sys, tempfile, shutil, numpy as np
from persist.archive import Archive

tmpdir = tempfile.mkdtemp()

x = np.arange(10)
y = np.arange(20)  # Implicitly reference in archive
b = [x, y]

archive = Archive(single_item_mode=True, array_threshold=5)
archive.insert(b1=b)
archive.save(dirname=tmpdir, package=True)

archive = Archive(scoped=False, single_item_mode=True, array_threshold=5)
archive.insert(b2=b)
archive.save(dirname=tmpdir, package=False)

!tree $tmpdir

sys.path.append(tmpdir)
import b1, b2
sys.path.pop()
for b_ in [b1, b2]:
    assert np.allclose(b_[0], x) and np.allclose(b_[1], y)
!rm -rf $tmpdir
/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmpval74y_i
|-- b1
|   |-- __init__.py
|   `-- _arrays
|       |-- array_0.npy
|       `-- array_1.npy
|-- b2.py
`-- b2_arrays
    |-- array_0.npy
    `-- array_1.npy

3 directories, 6 files

Note what is happening here… although we explicitly import b1, the result of this is that b1 = [x, y] is the list rather than a module. This behaviour is somewhat of an abuse of the import system, so you should not expose it too much. The use in DataSet is that these modules are included as submodules of the DataSet package, acting as attributes of the top-level package, but only being loaded when explicitly imported to limit memory usage etc.

Containers, Duplicates, and Circular References

The main complexity with archives is with objects like lists and dictionaries that refer to other objects: all object referenced in such “containers” need to be stored only one time in the archive. A current limitation is that circular dependencies cannot be resolved. The pickling mechanism provides a way to restore circular dependencies, but I do not see an easy way to resolve this in a human-readable format, so the current requirement is that the references in an object form a directed acyclic graph (DAG).

Archive Examples

Here we demonstrate a simple archive containing all of the data. We start with the simplest format which is obtained with scoped=False:

Non-scoped (flat) Format

We start with the scoped=False format. This produces a flat archive that is easier to read:

[8]:
import os.path, tempfile, shutil
from persist.archive import Archive

a = 1
x = range(2)
y = range(3)   # Implicitly reference in archive
b = [x, y, y]  # Nested references to x and y

archive = Archive(scoped=False)
archive.insert(a=a, x=x, b=b)

# Get the string representation
%time s = str(archive)
print(s)
CPU times: user 778 µs, sys: 210 µs, total: 988 µs
Wall time: 927 µs
from builtins import range as _range
_g3 = _range(0, 3)
x = _range(0, 2)
b = [x, _g3, _g3]
a = 1
del _range
del _g3
try: del __builtins__, _arrays
except NameError: pass

Note that intermediate objects not explicitly inserted are stored with variables like _g# and that these are deleted, so that evaluating the string in a dictionary gives a clean result:

[9]:
# Now execute the representation to get the data
d = {}
exec(s, d)
print(d)
d['b'][1] is d['b'][2]
{'x': range(0, 2), 'b': [range(0, 2), range(0, 3), range(0, 3)], 'a': 1}
[9]:
True

The potential problem with the flat format is that to obtain this simple representation, a graph reduction is performed that replaces intermediate nodes, ensuring that local variables do not have name clashes as well as simplifing the representation. Replacing variables in representations can have performance implications if the objects are large. The fastest approach is a string replacement, but this can make mistakes if the substring appears in data. The option robust_replace invokes the python AST parser, but this is slower.

Scoped Format

To alleviate these issues, the scoped=True format is provided. This is visually much more complicated as each object is constructed in a function. The advantage is that this provides a local scope in which objects are defined. As a result, any local variables defined in the representation of the object can be used as they are without worrying that they will conflict with other names in the file. No reduction is performed and no replacements are made, makeing the method faster and more robust, but less attractive if the files need to be inspected by humans:

[10]:
archive = Archive(scoped=True)
archive.insert(a=a, x=x, b=b)

# Get the string representation
%time s = str(archive)
print(s)
CPU times: user 412 µs, sys: 22 µs, total: 434 µs
Wall time: 452 µs

def _g3():
    from builtins import range
    return range(0, 3)
_g3 = _g3()

def x():
    from builtins import range
    return range(0, 2)
x = x()

def b(_l_0=x,_l_1=_g3,_l_2=_g3):
    return [_l_0, _l_1, _l_2]
b = b()
a = 1
del _g3
try: del __builtins__, _arrays
except NameError: pass

DataSet Format

This is the new format of DataSets starting with revision 1.0.

A DataSet is a directory with the following files:

  • _this_dir_is_a_DataSet: This is an empty file signifying that the directory is a DataSet.
  • __init__.py: Each DataSet is an importable python module so that the data can be used on a machine without the persist package. This file contains the following variable:
  • _info_dict: This is a dictionary/namespace with string keys (which must be valid python identifiers) and associated data (which should in general be small). These are intended to be interpreted as meta-data.

For the remainder of this discussion, we shall assume that _info_dict contains the key 'x'.

  • x.py: This is the python file responsible for loading the data associated with the key 'x' in _info_dict. If the size of the array is less than the array_threshold specified in the DataSet object, then the data for the arrays are stored in this file, otherwise this file is responsible for loading the data from an associated file.
  • x_data.*: If the size of the array stored in x is larger than the array_threshold, then the data associated with x is stored in this file/directory which may be an HDF5 file, or a numpy array file.

These DataSet modules can be directly imported. Importing the top-level DataSet will result in a module with the _info_dict attribute containing all the meta data. The data items become available when you explicitly import them.

DataSet Examples

[11]:
import os.path, pprint, sys, tempfile, shutil, numpy as np
from persist.archive import DataSet
tmpdir = tempfile.mkdtemp()  # Make temporary directory for dataset
print("Storing dataset in {}".format(tmpdir))

a = np.arange(10)
x = np.arange(15)

ds = DataSet('dataset', 'w', path=tmpdir, array_threshold=12, data_format='npy')

ds.a = a
ds.x = [a, x]
ds['a'] = "A small array"
ds['x'] = "A list with a small and large array"

!tree $tmpdir

del ds

ds = DataSet('dataset', 'r', path=tmpdir)
print(ds['a'])
print(ds['x'])
print(ds.a)   # The arrays a and x are not actually loaded until here
print(ds.x)
Storing dataset in /var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmplw345fr6
/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmplw345fr6
`-- dataset
    |-- __init__.py
    |-- _this_dir_is_a_DataSet
    |-- a.py
    |-- x.py
    `-- x_data
        `-- array_0.npy

2 directories, 5 files
A small array
A list with a small and large array
[0 1 2 3 4 5 6 7 8 9]
[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])]

As an alternative, you can directly import the dataset without the need for the persist library. This also has the feature of delayed loading:

[12]:
sys.path.append(tmpdir)
import dataset   # Only imports _info_dict at this point
print
print('import dataset: The dataset module initially contains')
print(dir(dataset))

import dataset.a, dataset.x   # Now we get a and x
print
print('import dataset.a, dataset.x: The dataset module now contains')
print(dir(dataset))

sys.path.pop()

shutil.rmtree(tmpdir)        # Remove files
import dataset: The dataset module initially contains
['__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_info_dict']
import dataset.a, dataset.x: The dataset module now contains
['__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_info_dict', 'a', 'x']